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Abstract — Motivated by recommendation systems, we consider 
the problem of estimating block constant binary matrices (of size 
m x n) from sparse and noisy observations. The observations 
are obtained from the underlying block constant matrix after 
unknown row and column permutations, erasures, and errors. 
We derive upper and lower bounds on the achievable probability 
of error. For fixed erasure and error probability, we show that 
there exists a constant Ci such that if the cluster sizes are less 
than Ci ln(mn), then for any algorithm the probability of error 
approaches one asra.moo. On the other hand, we show that 
a simple polynomial time algorithm gives probability of error 
diminishing to zero provided the cluster sizes are greater than 
C2 ln(mn) for a suitable constant C2. 

I. Introduction 

Recommender systems are commonly used to suggest con- 
tent (movies, books, etc.) that is relevant to a given buyer. The 
most common approach is to predict the rating that a potential 
buyer might assign to an item and use the predicted ratings to 
recommend items. The problem thus reduces to completion of 
the rating matrix based on a sparse set of observations. This 
problem has been popularized by the Netflix Prize ([1]). A 
number of methods have been suggested to solve this problem; 
■ see for example [2], [3], [4] and references therein. Recently, 
several authors ([5], [6] [7]) have used the assumption of a 
low-rank rating matrix to propose provably good algorithms. 
For example, in [5], [6], a "compressed sensing" approach 
based on nuclear-norm minimization is proposed. It is shown 
in [6] that if the number of samples is larger than a lower 
bound (depending on the matrix size and rank), then with 
high probability, the proposed optimization problem exactly 
recovers the underlying low-rank matrix from the samples. In 
[7], the relationship between the "fit-error" and the prediction 
error is studied for large random matrices with bounded rank. 
An efficient algorithm for matrix completion is also proposed. 

In this paper, we consider a different setup. We assume that 
there is an underlying "true" rating matrix, which has block 
constant structure. In other words, buyers (respectively items) 
are clustered into groups of similar buyers (respectively items), 
and similar buyers rate similar items by the same value. The 
observations are obtained from this underlying matrix (say M) 
as described below. 



1) The rows and columns of M are permuted with un- 
known permutations, that is, the clusters are not known. 

2) Many entries of M are erased by a memoryless era- 
sure channel. This models the sparsity of the available 
ratings. 

3) The non-erased entries are observed through a discrete 
memoryless channel (DMC). This channel models 

« the residual error in the block constant model, and, 
• the "noisy" behavior of buyers who may rate the 
same item differently at different times. 

One may also treat these two channels as a single effective 
DMC, but we prefer the above break-up for conceptual rea- 
sons. Our goal is to identify conditions on the cluster sizes 
under which the underlying matrix can be recovered with small 
probability of error. Our recommendation system model differs 
from [5], [6], and in particular, we do not seek completion of 
the observed matrix, but rather the recovery of the underlying 
M. As described above, our goal reduces to analyzing the 
error performance of the code of block-constant matrices over 
the channel described above. 

From a practical stand-point, it is desirable to consider the 
case when the parameters of the erasure channel and DMC are 
not known. However, in this paper, we consider the simpler 
case when these channel parameters are known. In particular, 
for simplicity, we consider the case when M is an m x n matrix 
with entries in {0, 1} and the DMC is a binary symmetric 
channel (BSC) with error probability p. The erasure probability 
is e. Our main results are of the following nature. 

• If the "largest cluster size" (defined precisely in Section 
llill) is less than C\ ln(mn)), then the probability of error 
approaches unity for any estimator of M as mn — * 00 
(Corollary [2] Part 2)). 

« We analyze a simple algorithm, which clusters rows and 
columns first, and then estimates the cluster values. We 
show that if the "smallest cluster size" is greater than a 
constant multiple of ln(mn), then the probability of error 
for this algorithm (averaged over the rating matrices), 
approaches zero as mn — > 00 (Theorem[3] Part 2)). Com- 
bined with the previous result, this implies that In (mn) 
is a sharp threshold for exact recovery asymptotically. 



• If we consider the probability of error for a fixed 
rating matrix, then the algorithm needs the smallest 
cluster size to be larger than a constant multiple of 
y/mn ln(m) ln(n). 
While we obtain the asymptotic results for fixed p and e, the 
bounds we obtain in the process also apply to the case when 
p, e depend on m, n. 

The paper is organized as follows. In Section|IIl we describe 
our model. The main results are stated and proved in Section 
[TTT1 We conclude in Section ITVl 

II. Our Model and Notation 

Suppose X is the unknown mxn rating matrix with entries 
in {0, 1}, where n is the number of buyers and m is the number 
of items. Let A = {Aj}[ =1 and B = {-Bj}* =1 be partitions 
of [1 : m] and [1 : n] respectively. The sets A t x Bj are the 
clusters in the matrix X. We call Aj's (P/'s) the row (column) 
clusters. We denote the corresponding row and column cluster 
sizes by mi and rij, and the number of row clusters and 
the number of column clusters by r and t respectively. (We 
note that the Ai's (respectively Bi's) need not consist of 
adjacent rows (respectively columns) and hence this notation 
is different from that in the introduction). The entries of 
X are passed through the cascade of a memoryless erasure 
channel with erasure probability e and a memoryless BSC 
with error probability p. While the erasure channel models 
the missing ratings, the BSC models noisy behavior of the 
buyers. The output of the channel, i.e. the observed rating 
matrix, is denoted by Y and its entries are in {0, 1, e}, where 
e denotes an erasure. We analyze the probability of error for a 
fixed rating matrix as well as the probability of error averaged 
over the rating matrices. We use the following probability law 
on the rating matrices. We assume that all row and column 
clusters have the same size mo and no respectively, and the 
rt constant blocks (of size mono) contain i.i.d. Bernoulli 1/2 
random variables. 

III. Main Results 

In Section IIII-AI we study the probability of error of 
the maximum likelihood decoder when the clusters A, B are 
known. This result provides a lower bound on the cluster size 
that ensures diminishing probability of error. In Section IHI-BI 
we analyze the probability of error in identifying the clusters 
for a specific algorithm. These results are integrated in Section 
IIII-CI to obtain conditions on the cluster sizes for the overall 
probability of error to diminish to zero. 

A. Probability of Error When Clustering is Known 

In this section, we study the probability of error of the 
maximum likelihood decoder for a given rating matrix X when 
A and B are known. We denote this probability by P e \A ,g(X). 
We note that the ML decoder ignores the erasures, counts the 
number of 0's and l's in each cluster Ai x Bj, and takes a 
majority decision. Ties are resolved by tossing a fair coin. The 
following theorem provides simple upper and lower bounds on 

Pe\A,B- 



Theorem 1: Let < p < 1/2, and let 

pi =e + 2(l -e)Vp(l-p) 

G{u) = l- J] (1-M m '™0- 

i=l,j=l 

Then the probability of error of the ML decoder satisfies the 
following bounds: 

G(e) < P e \ A , B (X) < G( Pl ). (1) 
Proof: We note that when p = 0, we make an error in 
a cluster iff all the entries in the cluster are erased. Since the 
erasures in different clusters are independent, it follows that 
P e \A ,e(X) = G(e) for p — 0. This gives the lower bound on 
Pe| as (X) forp>0. 

Next we prove the upper bound. Suppose in cluster A; x Bj 
we have s non erased samples. Then the probability of correct 
decision in this cluster is given by 

I'r.:/-.; . .;• = J2 ( jP q ( l - P) s ~ q if s is odd 

q=0 

+ i ^s^P^(l — p)' if s is even. 

Averaging over the number of non erased samples, the prob- 
ability of correct decision in cluster Aj x Bj is given by 

*( E ij) = E (T j )e m ^- s (l - WP^). (3) 

s=0 ^ S ' 

Since the erasure and BSC are memoryless 

Pe|AB( X )= Pr (u2l J = lP 4J ) 
r,t 

= 1- II Pr(P^). (4) 

Equations (01, and © specify the probability of error. The 
desired upper bound is obtained by deriving an upper bound 
on Pi(Efj s ). First we note that from ©, 

l-Pr(^.J<^Qp"(l-p) s ^. 

r 2 1 

But for < p < \ and q > §, p q (l - p) s ~ q < p*(l -£»)*. 
Substituting this in the previous equation, we have 

I , n/+,J > 1 - {2^pll~p)Y. (5) 

From Equations (01 and (0), we have Pr(Efj) > 1— p™ 1 ™ 3 and 
so from (01, P e |^.g(X) < G(pi). This completes the proof 
for the upper bound on P e |yt.e(X). ■ 
Let us define the smallest cluster size as 

s*(X) := imnminj, 

(6) 



and the largest cluster size as 



s*(X) := maxmjrij. 



The following corollary gives simpler bounds on P e u ! g(X). 

Corollary 1: Let iVx(s) be the number of clusters in X 
with exactly s elements. Let 

ln(2) 



s*(X) > 



ln(lM) ' 



Then 



P e |As(X) > 1 - exp |- J2 Nx(s)e s j , 
In particular, 

( i 

fe|AB( X ) > 1-expf — 
^e|As(X) <l-exp( 



(7) 



s=l 



S *(X) 



S *(X) 
21n(2)mn^* (X) ' 



(8) 



MX) 



Proof: The proof is based on upper and lower bounds for 
G(u). We note that (1 - x) < exp(-x) and for x 6 [0, 1/2], 
1 — x > exp(— 2 \n(2)x). Hence 

(r,t \ r,t 

-21n(2) J] « mi "i < JJ. (l-w minj ) 
i=i,i=i y i=i,i=i 

(r,t 
- y u,n, " j 
i=l,J=l 

Where the first inequality holds for u minj < ~. The sum in 
the exponent can be written in terms of the size of the clusters: 

r,t oo 
i—lj—l s—1 

The bounds (0 now follow from Theorem Q] by noting that 

p m in) < y 2 for s ^ x j > i n (2)/ln(l/pi). 

To prove ®, we note that 

oo 

y N x (s)u s < rtu-W < J™ «.W 

This gives the upper bound in ([8]). The lower bound in $Q 
follows similarly. ■ 
We are interested in studying the cluster sizes that guarantee 
correct decisions asymptotically. Though (|7]i is tighter than 
dHJ, the conditions arising out of I© are cleaner and are stated 
below. 

Corollary 2: Suppose we are given a sequence of rating 
matrices of increasing size, that is, mn — + oo. Then the 
following are true. 

1) If 

ln(mn) 



s*(X) > 



then P e \ A , B (X) -> 0. 
2) If 

«, (1 — 5) hi(mn) 

s* X < i — — L-\ — '-, for some 5 > 0, 
ln(l/e) 

then P e , AB (X) -» 1. 
Proof: First consider Part 1. From (O, using e _;E > 1 — x 
we get 



Pe\A,B&-) ^ 



2 ln(2)m?ip^ 

MxT 



(X) 



The RHS is a decreasing function of s*(X) and hence substi- 
tuting the lower bound on s*(X) we get 

W X)<™1«^0. 
m(mn) 

For Part 2, we note that 1 - cxp (-mn£ s *' x Vs*(X)) is a 
decreasing function of s*(X), and hence substituting the upper 
bound, we have from ((HJ 



^e|As( X ) > 1-exp 



ln(l/e) (mnf 
(1 — 5) ln(mn) 



But since (mn) s /lnmn — > oo, we have P e |^.e ~~ * 1- 

B. Probability of Error in Clustering 

Data mining researchers have developed several techniques 
for clustering data; see for example [8, Chapter 4]. In this 
section, we analyze a simple polynomial time clustering algo- 
rithm. The algorithm clusters rows and columns separately. To 
cluster rows, we compute the normalized Hamming distance 
between two rows over commonly sampled entries. For rows 
i, j, this distance is: 

1 " 

da =-J2l(Y ik ^ e, Y jk ? e) l(Y lk ? Y jk ). 

k=l 

If this is less than a threshold do, then the two rows are 
declared to be in the same cluster and otherwise they are 
declared to be in different clusters. We apply this process to 
all pairs of rows and all pairs of columns. Let 1^ be equal 
to 1 if rows i, j belong to the same cluster and let it be 
otherwise. The algorithm gives an estimate: 



Hj — 



1, dij < do, 
0, > d . 



We are interested in the probability that we make an error in 
row clustering averaged over the probability law on the rating 
matrices described in Section ITU 



Pp..rr. — Pi* 



(4- * 



Iij for some i, j 



In(lM) 



Once the rows are clustered, we can apply the same procedure 
to cluster columns. Below we analyze the error probability 
Pe jrc ; the probability of error in finding column clusters has 
similar behavior. 



Theorem 2: Suppose we are given a sequence of rating 
matrices with n — > oo and t n column clusters, such that 

limsupn^oo m/n < oo. Let 

/x := 2p(l -p)(l - e) 2 , 5 := (1 - e) 2 (l - 2p) 2 

and choose do = + Then there exists a positive constant 
Co such that if t n > Co ln(n), then P e ,rc — > 0. 

Proof: We start by considering the choice of the threshold. 
When i,j are in the same cluster, 



E[d ij \I ij =l,X]=2p(l-p)(l-e) 2 



/<■• 



When i,j are in different clusters, let be the number of 
columns in which i,j disagree. Then 



E[dij\Iij 



(l-ef 



We choose 



0,X] 

■ [(p 2 + (1 -p) 2 ) Sij + 2p(l-p)(n - Sy)] 



rf = M + 



where a„ is chosen below to obtain diminishing probability 
of error. 

First we bound the probability of error when = 1. We 
note that in this case dij is the average of n i.i.d. Bernoulli 
random variables with mean a = 2p(l — p)(l — e) 2 . Hence 



Pr(4 =1,X 



Pr U 



/« > — <5 

71 



1,X 



< 



exp 



/in 



(9) 



where in the last step we have used the Chernoff bound [9, 
Theorem 4.4, pp. 64]. 

Next consider the case ly = 0. In this case, dij is the 
average of n — s»j identically distributed Bernoulli random 
variables with mean /i and Sjj identically distributed Bernoulli 
random variables with mean v = (1 — e) 2 [p 2 + (1 — p) 2 ], all 
the random variables being independent. So we have 



Pr /,, t /,., =0,X 



< 



(1 - p + ne e ) n ~ s ^ (1 - v + ve e ) Si i 



e nd 6 

< exp [n{e e - - ndo^) , = /i + <5 



(9 < 

Si 



(10) 
(11) 



where in dTOl t we have used the Chernoff bound and in (fTTT i 
we have used the inequality 1 + x < exp(x). Choosing 9 = 
max(0, ln(do/Aj)) (which is the optimal choice), we have 



Pr 



(4 ¥> 



= o,x 



< 



exp I n(do — /3y) + ndo In 



1 if Sij < a r , 



(12) 



Note that for > a n , we have < (/3y — do) /do < 1, and 



Pij — do If Pij — do 
~ 6 



do J d 
Substituting in ( fT2l i. if Sij > a n , then 



pi- (4 + 



iy = 0,X < exp - 



<5 2 (sy - a„) ; 



' ' 1 6(nfj, + 6a n ) 
Taking expectation in ( TT2b and using ( fT3l ), we get, 



Ct n )'' 



(13) 



Prjiy ^0|iy =0, X 
< Pr(sy < a„) + E exp 

=:Ti + t 2 . 



6(np + 5a n ) 



We note that = noX, where X is Binomial(i„,l/2). Thus 
E[sij] = not n /2 = n/2 and var{s,y} = nno/4. Thus if no = 
o(n), then s^- concentrates around its mean. Hence to get a 
diminishing T\, we choose a n = n/3. Then 

r 1 = p(,„<2)=p(x<| 



(14) 



< P |X-t„/2| > 



< 2 exp 



54 



where we have used the Chernoff bound [9, Corollary 4.6, pp. 
67]. 

Substituting for a n in T^, we see that for a suitable positive 
constant c, 

(X-t n /3) 2 ' 



T 2 = E 



exp 



-cn - 



E 



exp 



-cn - 



(s - t n /3) ; 



exp 



n (s - t n /3f 



tr. 



"V" exp 



n (s - tg/ffi 



= E 

|*-t„/3|>tn/9 

+ E 

|»-*n/3|<t»/9 

<exp(-cn/81)+ ^ 2 - t "2 t "' l ( s / t "> 

|s-i„/3|<t„/9 

<exp(-cri/81) + i n 2- t ''< 1 - & < 4 / 9 ». 
From (fT~4l and (fl~5T >, it follows that 



(15) 



Pr J,, ?0\l ij =0,X 



<Ti+T 2 < nci exp(-c 2 i„). 

(16) 

where ci, c 2 are positive constants. Since there are only m(m— 
l)/2 pairs of rows, the desired result follows. ■ 
Remark: If we consider the probability of error in clustering 
for a fixed rating matrix, then to get diminishing probability 
of error asymptotically, we need 



mono > C y ran ln(m) ln(n) . 



C. Estimation Under Unknown Clustering 

In this section, we consider our full problem - estimation of 
the underlying rating matrix from noisy, sparse observations 
when clustering is not known. Our result is the following. 

Theorem 3: Consider the collection of block constant ma- 
trices with the probability law described in Section HI] Let 
m = j3n, (3 > fixed. Then there exist constants d, 1 < i < 4 
such that the following holds for t > C3 ln(n), r > C4 ln(m). 

1) If mono < C\ ln(mn), then for any estimator of X, 
P e — > 1 as n — > 00. 

2) Consider an estimator which first clusters the rows and 
columns using the algorithm described in Section IIII-BI 
and then uses ML decoding as in Section llll-Al assuming 
that the clustering is correct. If mono > C2 ln(mn), then 
for this algorithm P e — > as n — > 00. 

Proof: When A, B are known, then under our model 
all feasible rating matrices are equally likely. Hence the ML 
decoder gives the minimum probability of error and so we 
have P e > E\P e \A ,b(X)]. To prove Part 1), we next lower 
bound E[P e \ AJ3 (X.)]. Let T be the event that s*(X) > m Q n . 
We note that X £ T iff for some pair of row clusters all the 
t column clusters have been generated equal or for some pair 
of columns all the r row clusters have been generated equal. 
Using the union bound, we get that, 



Pr(T) < 12. + (2) < m 2 2 -t 
v ' ~ 2* 2 r ~ 



(17) 



We choose C\, C2 to ensure that the above bound decays to 
zero and hence Pr(T) — > 0. Now, 

£[P e , AB (X)] >£;[P £ | AK (X);T C ]. 

mono and from the lower 



But on the event T c , s*(X) = 
bound in (O we get 

> E[P b ,a B (X)] > 



(l-Pr(T)) 



1 — exp 



ln(l/e)(mn) 



(18) 



(1 — <5) ln(mn) _ 

which — > 1 as rnn — » 00. This proves Part 1). 

Next we prove Part 2). Let D denote the event that the 
clustering is identified correctly. We note that the probability 
of error in estimating X averaged over the probability law on 
the block constant matrices satisfies 

Pe < E [P e | AB (X)Pr(£>) +Pv(D c )] 
< E [P e | AB (X)] + (P e>rc + P e , cc ) 

where P eyCC is the probability of error in column clustering. 
The desired result follows from Part 1) of Corollary [2] and 
Theorem [2] ■ 



Remark: The above result states that for a fixed p, e, the 
smallest cluster size that leads to zero error asymptotically is 
0(ln(mn)) = 0(ln(n)). When p — 0, then we can also apply 
the method in [6] to our model, and this yields a smallest 
cluster size of 0(n 1 / 2 (ln(n)) 2 ), which is strictly worse than 
our result. 

Remark: In [7], the focus is on rating matrices of rank 0(1) 
and e = c/n, which leads to 0(n) observations. For our 
model, O(l) rank corresponds to a cluster size of 0(mn), 
and for e = c/n, our algorithm can be seen to give zero error 
asymptotically for any fixed rating matrix. 

IV. Conclusion 

We considered the problem of estimating a block constant 
rating matrix. The observed matrix is obtained through un- 
known relabeling of the rows and columns of the underlying 
matrix, followed by an error and erasure channel. Our prob- 
ability of error analysis showed that if the number of row 
clusters and the number column clusters are 51(ln(m)) and 
57(ln(n)) respectively, then the matrix can be clustered and 
estimated with vanishing probability of error if the cluster sizes 
are f2(ln(mn)). 
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